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1. INTRODUCTION 

Due to the rapid growth of internet applications and services, a vast amount of information is 
generated daily. This results in a phenomenon called information overload, making it challenging for users to 
find relevant information [1]. To address this problem, researchers have developed recommender systems 
(RSs), which filter out irrelevant services and items and provide personalized recommendations to users based 
on their historical preferences [2]. RSs can also incorporate additional information such as user profiles or item 
features [3]. They are mainly classified into three categories: collaborative filtering (CF), content-based (CB), 
and hybrid filtering [4]. CF is the most efficient and simple method among them. It has been used in many real- 
world systems such as Amazon. CF is divided into user-based and item-based filtering depending on the 
adopted prediction technique [5]. 

Clustering is one of the unsupervised learning techniques, it is widely used in recommender system 
(RS) to group the users or items based on their similarity. Each group is called a cluster and each cluster 
contains very similar members by a given data properties [6]. Despite being the most commonly used clustering 
method, k-means is known to have several disadvantages, such as being influenced by the initial centroids and 
being sensitive to the initial parameter settings [7]. However, nature-inspired algorithms have been shown to 
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be more effective in overcoming these weaknesses, as they have demonstrated their superiority over traditional 
clustering techniques through their swarm behavior, which enables them to achieve optimal solutions in a 
cooperative and organized manner [8]. The particle swarm optimization (PSO) algorithm is a well-known 
algorithm that was inspired by natural phenomena. It was introduced in 1995 by Kennedy and uses a population 
and probability approach to solve optimization problems [9]. There are several other nature-inspired 
algorithms, including the artificial fish swarm algorithm (AFSA) and ant colony optimization (ACO). AFSA 
was developed based on the social behavior of fish in swarms [10], while ACO was inspired by the foraging 
behavior of ants searching for food [11]. 

Deep learning achieves great success in several fields of applications such as speech recognition, 
computer vision, and natural language processing [12]. Academia and industry are becoming interested in 
applying deep learning to different applications because it is able to solve many problems and achieve high- 
quality results [13]. Recently, deep learning-based recommender systems have been actively investigated [14], 
where each user and item features are combined (or averaged, concatenated) to make predictions by following 
several perceptron layers. Deep structured models [15] look into users’ textual behaviors (search queries and 
browsing histories) and textual content, then maps the users and items into a latent representation where the 
similarity among the users’ preferences is maximized. Several research works try to combine collaborative 
filtering and deep learning into a collaborative deep learning-based recommendation in recent years [16]-[19]. 
Autoencoder is an approach recently introduced into the recommender system where non-linear matrix 
factorization is computed by the autoencoder framework with user-item ratings [20], [21]. 

The problem with deep learning-based CF models is that, they use all users (or items) in dataset to 
build the latent space which will be used later to predict the missing rates of each user, where the users with 
different interests will contribute to generating the predictions. As a result, the prediction accuracy will be low, 
because they are generated by users with different preferences. 

This research proposes an optimized clustering-based denoising autoencoder model. This model is 
different from other deep learning-based models as it trains multiple models instead of one, based on users’ 
preferences using k-means algorithm combined with a nature-inspired algorithm to determine the optimal 
initial centroids to cluster users based on their similar interests, and each cluster trains its own denoising 
autoencoder model. The rest of this paper is organized as follows, section 2 discusses the techniques that are 
used in the proposed system, section 3 presents the proposed method, the results are presented in section 4, and 
section 5 discuss the conclusions. 


2. THE COMPREHENSIVE THEORETICAL BASIS 

The research proposes an optimized clustering-based denoising autoencoder model (OCB-DAE). It is based on 
clustering optimization and autoencoders techniques. These topics will be briefly discussed in the following 
sub-sections. 


2.1. Clustering optimization 

K-means is one of the clustering algorithms in partitioning methods. It is the simplest, most used, and 
computationally efficient clustering algorithm [22]. K-Means divides data points into clusters (K) based on 
clusters’ centers or centroids. The centroid of each cluster is computed as the mean of all data points in that 
cluster. Before training the clustering model the users need to specify the number of clusters (K) [23]. 

Silhouette is one of the methods that are used to find the optimal number of clusters (K). It validates 
the consistency within clusters of data points. The silhouette method measures how much the data point is 
similar to its cluster compared to other clusters by computing its silhouette coefficients [24]: 


(b-a) 
max(a,b) 


coef ficients = 


() 


To overcome the limitation of k-means algorithm specially the influencing by the initial centroid, 
nature-inspired algorithms (NIA) can be combined with k-means algorithm to determine the optimal initial 
centroids that give the best clustering results. NIA can be utilized as optimization technique to identify the 
optimal initial centroids for each cluster. They begin by randomly generating centroids and assigning them as 
the initial centroids in k-means algorithm. The sum of squared errors (SSE) of the k-means output is then 
calculated using (2), which adds up the distance each data point and its corresponding centroid. The SSE is 
used as a measure of fitness, and NIA endeavor to minimize it to find the best centroids [10]. 


SSE-= Vihar Dvnecyll¥e — Hell? (2) 
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2.2. Autoencoders 

Autoencoder (AE) was first presented in 1991 by Kramer [25]. It is a deep learning algorithm which 
obtains high-level representation of original features. Autoencoders uses feed forward neural networks to learn 
the input representation with a com-pact dimension. The output of AE network attempts to reconstruct the 
input. It back-propagates the loss to train the network through the reconstruction process, by using the two 
parts: 
— Encoder: xz 
— Decoder: z—x 
there is only one hidden layer in the simplest case, where the encoder takes input x and maps it to z, then the 
decoder maps z into reconstructed x [26]. 

To discover more robust features through autoencoding and learning the identity function, Vincent 
presented the denoising autoencoder (DAE). DAE utilizes corrupted input x as x, and trains the network to 
denoise and reconstruct input x. Many corruption options can be used including the additive gaussian noise 


and multiplicative mask-out/drop-out noise [21]. 


3. METHOD 

The general steps of the proposed system are extracting user-genre matrix, users clustering, and build 
DAE model, as they are illustrated in Figure 1. These steps are applied to MovieLens 1M dataset, which 
contains 1,000,209 ratings of approximately 3,900 movies made by 6,040 users. Two tables are used out of 
this dataset: 
— Movies table contains all the movies with their features. Features represent the movies’ genres. 


— Ratings table contains the rates (in a scale 1 and 5) that given by users to movies. 
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Figure 1. General diagram of the proposed system 


3.1. Extract user-genre matrix 

The first step of the proposed system is loading the necessary data such as movies and ratings tables 
to be used for extracting the users’ preferences. Movies and ratings tables are merged to produce a single table 
that contains all the users as rows and all the movies’ genres (18 genres) as columns. The intersection them 
represents the average rate that given by a user to a movie’s genre. Part of the users-genres matrix is shown in 
Figure 2. 


Figure 2. Part of the user-genre matrix 


Action Adventure Animation Children's Comedy Crime Documentary Drama Fantasy a Horror Musical Mystery Romance 
user_id 
1 4200000 4.000000 4111111 4.250000 4.142857 4.000000 0.000000 4.428571 4.00 0.000000 0.000000 4.285714 0.000000 3.666667 
2 3.500000 3.736842 0.000000 0.000000 3.560000 3.583333 0.000000 3.898734 3.00 4.000000 3.000000 0.000000 3.333333 3.708333 
3 3.956522 4.000000 4.000000 4.000000 3.766667 0.000000 0.000000 4.000000 450 0.000000 2.666667 4.000000 3.000000 3.800000 
4 4157895 3.833333 0.000000 4.000000 0.000000 5.000000 0.000000 4.166667 450 0.000000 4.333333 0.000000 0.000000 4.000000 
5 2.612903 3.000000 4.000000 3.833333 3.410714 3.285714 3.666667 3.096154 0.00 4.000000 2.800000 3.333333 3.125000 3.100000 
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3.2. Users clustering 

The silhouette method is applied to the user-genre matrix to determine the best k value by using (1). 
AFSA is combined with k-means (AFSA-KM) to determine the optimal initial centroids for clustering. In this 
step, the users with similar interests are grouped together. The similarities among users are computed based on 
the average rates that each user gave to each movie’s genre in user-genre matrix. The output of this step is the 
clusters that users belong to, this information is added as a new column to the ratings table to be used for 
training the model. The AFSA-KM algorithm steps are presented in Algorithm 1. 


Algorithm 1. AFSA-KM algorithm 

Input: user-genre matrix, number of clusters (K), and population size (i) 
Output: users’ clusters 

As Generate random initial K centroids Xi; for each fish AFi 

2 Execute k-means on user-genre matrix using Xi; as initial centroids 
> Compute SSE(Xi;) for each AFiusing (2) 

4, Best=min SSE (Xi;) 

5a While (t<Max iteration) do 

6 For each AFi do 

7 Execute Follow Behavior on Xi;‘*) 

8. Execute Swarm Behavior on Xj,‘ 

o. If F(Xi5;"), sorrow) <F(Xi3), swarm) 


10: Xig 2) =Xi5'),  cort0w 
11. Else 

12. Xi (FD =Xig ©), swarm 
13. End if 

14. End for 

15:5 If F(Xar) <F (Best) 

16:3 Best=Xar 

17. End if 


18. End while 
19. Execute k-means on user-genre matrix using Best as initial centroids 


3.3. Build DAE model 

Based on the clustering information extracted from the previous step, the ratings data is divided into 
K sets. Each set trains a DAE model by following same model structure. The proposed model is designed as 
an item-based model, where the input data represented in form of item-users matrix (7;). Figure 3 shows the 
structure of the proposed model, which consist of the following layers: 


Output 


Figure 3. The structure of proposed model 


3.3.1. Input layer 

The number of nodes in the input layer equals the number of users within a cluster Cx. Where the input 
values 7; represent the rates R the are given by C; users to an item i, 7; = {Ry;, Ro, R3i, Rm} Where m is the 
number of C; users. The input data is further corrupted 7, by applying dropout noise. It sets the input data to 
zero randomly based on a noise ratio. 
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3.3.2. Hidden layer 
The hidden layer represents the latent space of the input data. It has a smaller number of nods, and 
they are fully connected to the input data 7;. The output of the hidden layer is z which is computed by: 


z=fW*7+b) (3) 
where, f is a non-linear activation function, W is the wights of the hidden nodes, and b is biases. 


3.3.3. Output layer 
The output layer has the same number of nods in the input layer. They are fully connected with the 
hidden layer’s nods. The output of this layer is the predicted rates 7% which are computed by: 


®= fV*z+b) (4) 


where, f is a linear activation function, V is the wights of the output nodes, and b is the biases. 

During the training, the model keeps trying to minimize the error (loss) the actual rating 7; and the 
predicted rating 7% through several epochs. Adam optimization algorithm [27] is used to update the model’s 
weights to reduce the mean squared error (MSE) which computed using (5). The steps of the proposed model 
are illustrated in Algorithm 2. 


Algorithm 2. Cluster-based denoising autoencoder RS 
Input: Ratings data 
Output: Predicted rates; 
Ve ke number of clusters 
ic0 
While i<=k do 
Load the ratings of clusteri 
m«number of items rated by clusteri users 
nenumber of clusteri users 
Generate mXn matrix r; 
Split r; into training and testing sets 
7, -dropout_noise (7, ratio) 
Build DAE model with input data % 
While epoch<max do 
Compute Z using (3) 
Compute *% using (4) 
Compute MSE between r; and % using (5) 
Update parameters using Adam optimizer 
End while 
Generate predictions 
ieitl 
End while 


omMArAInauw FWD 


PR 
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at .—f)2 
MSE= 2=101 7)" (5) 


3.3.4. Model evaluation 
To evaluate the performance of the proposed model, root mean squared error (RMSE) is used. It has 
a straightforward relation with MSE which is computed by: 


RMSE=v MMSE (6) 


4. RESULTS AND DISCUSSION 

This section presents the results of users clustering and the proposed model evaluation. Moreover, it 
provides the needed comprehensive discussion. Results are presented in charts and tables to help the reader 
understand the whole principle. Therefore, the discussion will be illustrated in the following sub-sections. 


4.1. Users clustering results 

The best number of clusters for the applied dataset is determine based on the silhouette coefficients 
score using (1). Different numbers of clusters (k) are used 2 and 20. As Figure 4 shows, the score kept 
decreasing until k=11 where it raised before it decreased again. Based on this result the selected number of 
clusters is 11. 
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Figure 4. Silhouette scores of different numbers of clusters 


For users clustering, we compared different clustering methods to adopt the better method. In 
comparison we applied k-means alone, ACO with k-means (ACO-KM), PSO with k-means (PSO-KM), and 
AFSA-KM. The comparison is based on SSE scores using (2) where the lowest score is the better. All methods 
used the same parameters such as, the number of clusters is 11, the number of dimensions is 18 (number of 
genres), the number of populations is 8, and the maximum number of iterations is 100. Table 1 shows the 
comparison results where AFSA-KM achieved the better result in term of SSE. It worth to mention that, using 
NIA along with k-means gives better results comparing with using k-means alone. 


Table 1. SSE cores of different clustering methods 


Algorithm SSE 
k-means 97487.79442 
ACO-KM 95840.56913 
PSO-KM 95840.47005 

AFSA-KM 95840.45985 


4.2. Model evaluation results 

The experimental results of the proposed model OCB-DAE are computed by taking the average 
RMSE of all modeled clusters using (6). The proposed model uses 80% of dataset for training and 20% for 
testing. The input data is corrupted with dropout noise ratio 50% and the number of nodes in the hidden layer 
is 256. Sigmoid and linear activation functions are used in hidden and output layers respectively. Where Adam 
optimizer with 0.0001 learning rate is used to update the wights. The results of all modeled clusters are shown 
in Table 2. OCB-DAE model is compared with other models such as AE and DAE to evaluate its performance. 
The comparison of RMSE scores and the models’ parameters are illustrated in Table 3. All the models in Table 
3 share the same parameter settings, except in DAE and OCB-DAE models the dropout noise ratio is 0.5, for 
that reason they use a smaller regularization rate such as 0.001. OCB-DAE has a smaller number of nodes in 
the hidden layer (256) because it has a smaller number of nodes in the input layer as a result of the clustering 
process, for the same reason it uses a smaller batch size. The sparsity of the training data of AE and DAE 
models is 3.77%, while in OCB-DAE it varies in each cluster with average 3.37%. 


Table 2. RMSE of each modeled cluster 


Cluster __#Items (Samples) _ #Users (Input nodes) _ Sparsity (%) RMSE 
0 3,194 742 4.58 0.676 
1 2,715 377 1.35 0.5033 
2 3,268 652 5.07 0.748 
3 2,483 361 1.71 0.495 
4 2,386 375 1.5 0.5149 
5 2,404 370 2 0.5834 
6 2,281 326 1.87 0.4925 
7 3,563 783 7.47 0.784 
8 2,616 510 2.09 0.6034 
9 2,839 542 3.01 0.6536 
10 3,533 1,002 6.4 0.7441 

Total: 6,040 Average: 3.37 _ Average: 0.6180 
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Table 3. Performance compassion the proposed model and other models 
Model Batch size #Hiddennodes Regularization rate RMSE 


AE 256 500 0.01 1.0028 
DAE 256 500 0.001 0.9129 
OCB-DAE 128 256 0.001 0.6180 


5. CONCLUSION 

This research proposed an optimized clustering-based denoising autoencoder recommender system 
that utilized a nature-inspired algorithm such as artificial fish swarm algorithm to improve the k-means 
algorithm by determining the optimal initial centroids to divide the users into k clusters based on their similar 
interests. Each cluster’s members will cooperate to extract the latent space of the users’ rates by using denoising 
autoencoder model. Using AFSA with k-means showed a great improvement in clustering results. The 
proposed model was trained and evaluated with MovieLens 1M dataset where 80% of it is used for training 
and 20% for testing. RMSE the predicted and the actual test data was (0.618) which outperformed other models 
that use autoencoder and denoising autoencoder models without clustering. 
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